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Abstract  In  this  paper,  we  analyze  an  internal  goal  struc¬ 
ture  based  on  heuristic  dynamic  programming,  named 
GrHDP,  to  tackle  the  2-D  maze  navigation  problem.  Classi¬ 
cal  reinforcement  learning  approaches  have  been  introduced 
to  solve  this  problem  in  literature,  yet  no  intermediate  reward 
has  been  assigned  before  reaching  the  final  goal.  In  this  paper, 
we  integrated  one  additional  network,  namely  goal  network, 
into  the  traditional  heuristic  dynamic  programming  (HDP) 
design  to  provide  the  internal  reward/goal  representation.  The 
architecture  of  our  proposed  approach  is  presented,  followed 
by  the  simulation  of  2-D  maze  navigation  (10*10)  problem. 
For  fair  comparison,  we  conduct  the  same  simulation  envi¬ 
ronment  settings  for  the  traditional  HDP  approach.  Simula¬ 
tion  results  show  that  our  proposed  GrHDP  can  obtain  faster 
convergent  speed  with  respect  to  the  sum  of  square  error,  and 
also  achieve  lower  error  eventually. 

Keywords  Goal  representation  heuristic  dynamic 
programming  (GrHDP)  •  Maze  navigation/path  planning  • 
Adaptive  dynamic  programming  (ADP)  • 

Reinforcement  learning  (RL) 

1  Introduction 

In  the  past  decades,  reinforcement  learning  (RL)  and  adap¬ 
tive  dynamic  programming  (ADP)  techniques  have  been  fre¬ 
quently  employed  for  the  prediction  and  optimization  to  find 
the  optimal  control  policy  over  time.  For  instance,  heuristic 
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dynamic  programming  (HDP),  dual  heuristic  dynamic  pro¬ 
gramming  (DHP),  and  globalized  dual  heuristic  dynamic  pro¬ 
gramming  (GDHP),  have  been  proposed  in  Werbos  (1992, 
1990)  to  seek  the  optimal  control  policy  (solution  for  Bell¬ 
man’s  equation).  Various  versions  of  ADP,  such  as  the  action- 
dependent  (AD)  designs  and  model-based  designs,  are  devel¬ 
oped  and  presented  with  the  learning  and  control  capabili¬ 
ties  on  various  applications.  More  recently,  high-level  under¬ 
standing  of  ADP  also  discussed  the  fundamental  principles 
for  ADP  on  the  optimization  and  learning  capabilities  over 
time  Werbos  (2013,  2008,  2009).  For  instance,  the  online 
direct  HDP  was  proposed  and  developed  in  Si  and  Wang 
(2001),  Yang  et  al.  (2009),  Liu  et  al.  (2012),  where  the  authors 
took  the  advantages  of  the  potential  scalability  of  the  adap¬ 
tive  critic  designs  and  the  intuitiveness  of  Q-leaming.  It  was 
also  an  online  learning  scheme  that  simultaneously  updated 
the  value  function  and  the  control  policy.  For  model-based 
DHP/GDHP  design,  the  authors  demonstrated  the  conver¬ 
gent  analysis  in  terms  of  cost  function  and  control  law  in 
Liu  et  al.  (2012),  Wang  et  al.  (2012),  Liu  and  Wei  (2013). 
In  addition,  the  performance  comparison  among  HDP,  DHP 
and  GDHP  are  studied  and  presented  with  the  the  auto¬ 
lander  helicopter  problem  in  Prokhorov  and  Wunsch  (1997), 
Prokhorov  (1997),  Prokhorov  et  al.  (1995).  Recent  research 
books  provided  the  deep  overview  of  RL  and  ADP  on  both 
stability/convergent  analysis  and  various  of  complex  indus¬ 
trial  applications  Si  et  al.  (2004),  Lewis  and  Liu  (2013). 

Recent  papers  on  the  exploration  of  internal  reward  (goal) 
have  demonstrated  the  significance  in  ADP/RL  communities. 
It  has  been  proposed  and  demonstrated  in  He  et  al.  (2011, 
2012a,b)  that  a  three-network  architecture  can  achieve  bet¬ 
ter  control  performance  comparing  with  the  traditional  ADP 
design  on  several  balancing  benchmarks.  In  addition,  hierar¬ 
chical  HDP  design  is  presented  with  significant  improvement 
with  respect  to  the  average  successful  trial  number,  compar- 
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in g  with  both  three-network  design  and  the  traditional  ADP 
design  in  Ni  et  al.  (2012a, b),  He  et  al.  (2012c).  Furthermore, 
people  also  showed  that  the  performance  of  ADP  controller 
could  be  improved  by  second-order  learning  algorithm  on 
complex  industrial  system  in  Fu  et  al.  (2011a,b,c).  In  addi¬ 
tion,  stability  analysis  on  dual-critic  design  with  tracking 
controller  has  been  provided  and  verified  on  numerical  sim¬ 
ulation  benchmarks  in  Ni  et  al.  (2013).  Real-time  tracking 
control  with  this  dual-critic  design  on  visual  reality/simulink 
platform  is  also  presented  in  Ni  et  al.  (2013),  Fang  et  al. 
(2012). 

Maze  navigation  is  the  typical  Markov  decision  process 
(MDP)  benchmark  and  has  been  tested  in  the  ADP/RL  com¬ 
munity.  For  instance,  in  Pang  and  Werbos  (1996),  Wunsch 
(2000),  it  has  been  proposed  to  learn  the  value  table  with 
adaptive-critic  designs  in  a  closed-loop  form  with  simulta¬ 
neous  recurrent  neural  network  (SRN).  In  Ilin  et  al.  (2006, 
2007,  2008),  the  authors  proposed  to  improve  the  learning 
process  with  cellular  SRN  and  Kalman  filter  integrating  into 
ADP  design  on  the  same  problem.  Furthermore,  in  Wier- 
ing  and  Van  Hasselt  (2007),  the  authors  compared  classical 
Q-learning,  Sarsa(A.),  conventional  actor-critic  design  and  the 
proposed  QV-leaming  on  the  maze  navigation  benchmark, 
and  showed  the  improved  learning  process  with  the  proposed 
approach.  Although  recent  advancements  of  ADP  research 
have  demonstrated  many  critical  applications  across  different 
domains,  it  has  been  recognized  that  the  2-D  maze  navigation 
problem  is  a  significant  challenge  for  the  society  Werbos  and 
Pang  (1996). 

In  this  paper,  we  extend  our  previous  work  on  goal  repre¬ 
sentation  design  for  MDP  benchmarks  Ni  et  al.  (2013),  and 
focus  on  the  comparison  between  the  proposed  approach  and 
the  traditional  HDP  approach.  From  the  viewpoint  of  proof- 
of-concept,  we  conduct  the  same  simulation  environment 
setting  for  both  approaches  and  adopt  the  gradient  descent 
method  as  the  learning  algorithm.  In  specific,  there  are  three 
neural  networks  in  our  proposed  goal  representation  HDP 
(GrHDP)  approach:  an  action  network,  a  critic  network,  and 
a  goal  network.  The  motivation  is  to  represent  the  detailed 
goal  signal,  which  can  be  able  to  be  tuned  adaptively  and  effi¬ 
ciently,  according  to  the  system  state.  In  this  way,  rather  than 
a  discrete  reinforcement  signal  in  the  traditional  ADP  and 
RL  approach,  our  goal  network  can  automatically  provide 
an  internal  goal  signal  based  on  the  (external)  reinforcement 
signal,  to  achieve  optimal  action  selection.  For  fair  compari¬ 
son,  we  evaluate  our  proposed  GrHDP  approach  and  regular 
HDP  approach  with  the  same  parameter  settings.  The  learn¬ 
ing  curves  show  that  GrHDP  and  HDP  can  both  learn  the 
value  table  online.  However,  our  proposed  GrHDP  approach 
can  not  only  show  faster  convergent  speed,  but  also  achieve 
lower  sum  of  squared  error  in  the  end. 

The  rest  of  the  paper  is  organized  as  follows:  Sect.  2  shows 
the  architecture  design  of  our  proposed  GrHDP  framework 

Springer 


on  the  maze  navigation  problem,  and  also  provides  the  learn¬ 
ing  algorithm  for  the  goal  network.  Simulation  results  on 
GrHDP  and  HDP  under  the  same  environment  settings  are 
presented  and  compared  in  Sect.  3.  Finally,  the  conclusion  is 
provided  in  Sect.  4. 


2  GrHDP  structure  for  maze  navigation 


We  provide  the  interaction  diagram  between  the  proposed 
GrHDP  design  and  the  maze/environment  in  Fig.  1.  From 
this  figure,  we  can  see  that  the  action  network  observes  the 
system  state  from  the  maze/environment  and  provides  the 
action  based  on  the  current  state.  The  (external)  reward  will 
be  provided  by  the  environment  based  on  the  performance 
of  the  corresponding  action.  As  for  the  HDP  controller  (i.e. 
the  middle  part  in  Fig.  1),  we  keep  the  similar  design  with 
traditional  the  HDP  in  Si  and  Wang  (2001).  That  is  to  say, 
we  adopt  model-free  action  dependent  (AD)  design  for  our 
GrHDP  and  also  use  the  gradient  descent  algorithm  for  the 
learning  of  all  the  neural  networks.  Instead  of  the  traditional 
(external)  discrete  reward  assignment  in  maze  navigation,  our 
proposed  GrHDP  design  integrates  a  goal  network  to  learn 
from  (external)  reward  r,  and  provide  the  critic  network  with 
a  detailed  internal  reward  s.  In  this  paper,  we  defined  the 
(external)  reward  as 


r  = 


1,  reach  the  goal 

—0.2,  out  of  bound 
0,  regular  move 


(1) 


The  explicit  explanation  for  this  reward  on  the  maze  is  also 
presented  in  the  lower  part  of  Fig.  1. 

We  employ  typical  multi-layer  perceptron  (MLP)  struc¬ 
ture  for  the  all  the  neural  networks  here.  In  order  to  closely 
connect  the  goal  network  with  the  critic  network,  we  set  the 
internal  reward  s  to  be  one  of  the  inputs  for  the  critic  network. 
Therefore,  the  input  of  the  goal  network  and  critic  network 
can  be  denoted  as  xg  =  [X,  u]  and  xc  =  [X,  u,  s],  respec¬ 
tively.  The  inputs  for  the  action  network  is  the  current  system 
state  vector,  and  the  outputs  of  the  action  network  refer  to 
the  four  directions  (i.e.  the  outputs  of  the  action  network  is  a 
4*1  vector).  In  order  to  show  our  original  contribution,  we 
will  only  discuss  the  learning  algorithm  of  goal  network  and 
briefly  provide  the  error  (objective)  function  for  both  critic 
network  and  action  network. 


2.1  Learning  in  goal  network 

In  literature,  people  generally  assign  the  instant  reward  to  be 
0  unless  the  agent  reaches  the  goal  in  the  maze  navigation 
problem  Mitchell  (1997),  Sutton  and  Barto  (1998).  In  recent 
years,  there  seems  to  be  growing  attention  to  see  if  there  is 
any  improvement  if  a  non-zero  instant  reward  is  assigned 
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Fig.  1  The  proposed  GrHDP  framework  on  maze  navigation.  Two  dash  lines  separate  the  diagram  into  three  parts:  goal  representation,  traditional 
HDP  design,  maze  navigation  benchmark 


for  the  agent  during  the  learning  process  He  (201 1),  He  et  al. 
(2012c).  Various  reward/cost  functions  are  defined  according 
to  different  applications,  however,  such  reward/cost  functions 
are  strongly  domain-oriented  and  it  is  difficult  to  define  such 
a  proper  function  in  general.  Therefore,  it  is  desirable  to  find  a 
general  reward/cost  function  that  can  be  able  to  learn  and  self¬ 
adjust  in  various  environment.  In  this  paper,  we  propose  to 
build  a  general  mapping  between  the  system  state  (including 
the  control  action)  and  the  internal  goal  signal  by  using  a 
neural  network.  In  addition,  we  integrate  such  a  network  into 
the  HDP  framework.  The  internal  goal  signal  can  then  be 
represented  as 


and 

Eg(k)  =  ]-e2g(k).  (5) 

The  state  vector  is  X  =  [x\,  X2,  . . .  xn  ] ,  where  n  is  number 
of  element  in  state  vector,  and  the  control  action  is  u  = 
[u  i ,  U2 ,  . . .  um  ] ,  where  m  is  the  number  of  element  in  control 
vector.  The  input  vector  for  the  goal  network  is  defined  as 
xg  =  [ X ,  u]  and  the  output  is  the  internal  goal  s,  which  is  a 
scalar  here.  Sigmoid  function  is  defined  as 


s  =  f  (X,  u).  (2) 

The  motivation  of  this  design  is  to  introduce  the  goal  net¬ 
work  to  represent  the  internal  reward/goal,  and  approximate 
the  discounted  total  future  reward.  Thus  we  define  the  inter¬ 
nal  reward/goal  as 

s{k)  =  r(k  +  1)  +  yr{k  +  2)  +  y2r(k  +  3)  H -  (3) 

where  y  is  the  discounted  factor,  and  r  is  the  (external)  reward 
signal  defined  in  (1).  Here  the  sequence  of  r  (k  +  1),  r  (k  +  2), 
r(k  +  3)...  are  the  future  reward  signals. 

Therefore,  the  error  function  for  goal  network  is  defined 
as 

eg{k)  =  ys{k)-[s{k-\)-r{k)l  (4) 


to  constrain  the  output  into  [  —  1 ,  1  ] .  Here  sigmoid  function  is 
applied  on  all  hidden  nodes  and  the  output  node  as  presented 
in  Fig.  2.  The  forward  paths  of  goal  network  is  provided  as 
follows. 

s(k)  =  0  (/(*)) 

N gh 

m  =  z  4?  (*)?.■(*) 

yi(k)  =  <p(Zi(k)),  i  =  \,...,Ngh 

m+n 

Zi (k)  =  Z  4u  tk)xj (k)  +  Z  dg)j(k)uj-„(k) 

7—1  ’  7=l+n 

where  Zi  and  y/  refer  to  the  input  and  the  output  of  the  i  -th 
hidden  node.  I  is  the  input  for  the  output  node,  afp  and  cc^ 
denote  the  weights  of  the  input  to  hidden  layer  and  the  hidden 
to  output  layer  in  the  goal  network,  respectively.  Ngh  is  the 
number  of  hidden  node  in  goal  network.  We  have  denoted 
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Fig.  2  MLP  structure  of  goal  network.  Sigmoid  function  is  applied  for 
both  hidden  nodes  and  output  nodes 

these  parameters  in  Fig.  2,  where  readers  can  easily  follow 
the  forward  learning  paths  in  (7)  for  the  goal  network. 

We  adopt  gradient  descent  method  to  minimize  the 
approximation  error  in  (5).  The  weights  from  hidden  to  out¬ 
put  layer  are  tuned  as 

d_Eg(V  =  dEg(k)ds(k)  dm 
dw%\k)  ds(k)  dl(k)  dw(^\k) 

The  weights  from  input  to  hidden  layer  are  tuned  as 

dEgjk )  _  3Eg(fc)  3g(fc)  3f(fc)  3y,-(fc)  dzj(k) 
dw£\k)  Ss(k)  dl(k)  dym  dziik )  dw^ik) 

The  weights  are  tuned  in  the  order  of  goal  network,  critic 
network  and  action  network.  After  the  weights  in  goal  net¬ 
work  are  tuned,  we  fix  these  weights  thereafter  and  start  to 
tune  the  weights  in  the  critic  network  and  action  network. 

2.2  Learning  in  critic  network  and  action  network 

The  critic  network  and  the  action  network  in  our  design  are 
similar  with  the  existing  designs  in  Si  and  Wang  (2001),  Yang 
et  al.  (2009),  He  and  Jagannathan  (2007).  We  introduce  one 
more  input,  namely  internal  reward  signal  s,  for  the  critic 
network  and  aim  to  help  the  value  function  approximation. 
In  addition,  the  error  (objective)  function  for  critic  network 
is  different  with  those  in  existing  designs.  As  the  critic  net¬ 
work  is  set  to  approximate  the  discounted  total  future  internal 
reward/goal  s  with  value  function  /,  we  can  write  the  value 
function  as 

J(k)  =  s(k  +  1)  +  ys(k  +  2)  +  y2s(k  +  3)  +  •  •  •  (10) 

Then  the  error  function  of  critic  network  can  be  defined 
as 

ec(k)  =  yJ(k)-\J(k-l)-s(k)].  (11) 

and 

Ec(k)  =  '-e2(k).  (12) 


We  apply  the  same  gradient  algorithm  to  minimize  Ec  as 
above.  Once  we  finish  the  weights-tuning  in  the  critic  net¬ 
work,  we  will  start  the  online  learning  of  action  network. 

As  the  objective  of  the  action  network  is  to  maximize  the 
total  reward,  we  define  the  error  function  of  action  network 
as 

ea(k)  =  J  (k)  —  Uc.  (13) 

and 

Ea(k)  =  l-e2a(k).  (14) 

where  Uc  is  the  ultimate  utility  function.  The  same  as  the 
critic  network,  we  adopt  the  gradient  descent  algorithm  here 
to  minimize  (14). 

2.3  Learning  on  maze  navigation 

In  maze  navigation  benchmark,  we  assume  that  the  agent 
starts  with  the  initial  state  from  the  updating  sequence  (each 
updating  sequence  is  assumed  to  visit  all  the  state  enough 
times).  Our  proposed  GrHDP  controller  learn  to  provide  the 
action  based  on  the  position  of  the  agent.  We  apply  winner- 
take- all  (WTA)  method  to  determine  the  direction  for  the 
agent  to  go.  The  goal  network  will  provide  the  internal  goal 
based  on  the  direction  and  the  updated  state  of  the  agent. 
This  internal  goal  signal  is  set  as  one  of  the  input  for  the 
critic  network,  which  will  then  evaluate  the  performance  of 
the  agent  for  the  corresponding  action. 

In  our  learning  process,  we  keep  checking  if  the  agent  is 
out  of  bound  or  reaching  the  goal  after  the  action.  If  the  agent 
is  out  of  bound,  we  will  set  the  punishment  and  start  another 
trial.  If  the  agent  reaches  the  goal,  we  assign  the  reward  and 
regard  this  as  the  end  of  the  trial.  Otherwise,  the  agent  is 
kept  moving  forward  (i.e.  we  adopt  the  infinite  step  looking- 
ahead).  We  update  the  /  (x ,  u)  value  table  after  each  trial,  and 
compare  the  learned  value  table  with  the  reference  value  table 
to  show  the  learning  process.  We  will  terminate  the  learning 
process  when  the  trial  number  satisfies  the  maximum  number 
we  assign.  In  this  simulation,  we  set  independent  runs  to  show 
that  the  learning  process  could  be  duplicated.  We  show  the 
learning  curves  and  value  tables  by  taking  the  average  of  the 
results  in  different  runs. 

3  Simulation  results  and  analysis 

3.1  Algorithm  implementation 

The  environment  of  the  2-D  maze  navigation  is  presented  in 
the  lower  part  of  Fig.  1.  In  this  simulation,  we  denote  that 
the  instant  reward  between  any  two  state  v  and  x'  by  taking 
the  action  u  as  r(x,  x')  or  r(x,  u).  We  assume  that  there  are 
N  possible  states  in  the  maze.  The  transition  probabilities 
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between  the  two  state  x  and  x'  can  only  take  the  value  of  0  or 
1  (i.e.  the  maze  navigation  problem  here  is  a  deterministic  and 
finite  MDP).  Thus,  the  Bellman’s  equation  can  be  rewritten 
as 

N 

J*(x ,  u)  =  argmax  (r(x,  u)  +  y  ^  u')^j  (15) 

7  —  1 

where  J*(x,u )  is  the  maximum  total  reward  at  state  x  by 
taking  the  action  u. 

In  this  maze  navigation  benchmark,  our  objective  is  to 
employ  learning  algorithms  to  learn  the  value  table  online 
so  that  the  agent  can  move  according  to  the  direction  that 
maximizes  the  total  reward  (towards  the  goal  location).  For 
fair  comparison,  we  conduct  our  proposed  GrHDP  and  tra¬ 
ditional  HDP  algorithm  with  the  same  environment  settings 
and  initial  parameters.  The  algorithms  and  parameter  settings 
for  both  approaches  are  summarized  as: 

1.  HDP:  Online  model-free  HDP  proposed  in  Si  and  Wang 
(2001)  is  used  here.  The  initial  learning  parameters  are  set 
as:  lc  =  0.005  and  la  =0.01,  where  lc  and  la  refer  to  the 
learning  rate  of  critic  network  and  action  network,  respec¬ 
tively.  The  stopping  criteria  are:  Nc  =  20,  Na  =  30, 
Tc  =  le  —  4  and  Ta  =  le  —  4.  That  is  to  say  that  the 
learning  process  of  critic/action  network  will  be  termi¬ 
nated  either  if  the  error  drops  under  the  threshold  Tc  /  Ta 
or  the  iteration  number  meets  the  threshold  Nc/Na. 

2.  GrHDP:  The  same  parameters  are  applied  here  if  our 
proposed  GrHDP  approach  has  the  same  architecture  as 
those  of  the  HDP  approach  above.  In  addition,  the  ini¬ 
tial  parameters  for  the  goal  network  are:  lg  =  0.012, 
Tg  =  le  —  4  and  Ng  =  25.  Furthermore,  for  fair  compar¬ 
ison,  we  also  set  that  the  GrHDP  and  HDP  start  with  the 
same  initial  weights  between  [—0.3,  0.3]  and  the  same 
updating  sequence. 

3.2  Simulation  setup  and  study 

In  the  simulation,  we  assume  that  (1)  every  state  in  the  maze 
has  been  visited  enough  times;  (2)  every  action  (up,  down, 
left,  right)  has  been  taken  enough  times  for  each  state;  (3) 
for  every  initial  state,  the  agent  can  go  infinite  steps  forward 
unless  it  reaches  the  goal  or  it  hits  the  bound.  The  input  for 
the  action  network  is  the  current  state  vector  that 

=  [x\,  x2\  (16) 

The  input  for  the  goal  network  and  the  critic  network  are  that 

Xg  =  [x\ ,  X2,  Ml,  M  2,  M3,  M4]  (17) 

and 

xc  =  [x\,  x2,  mi,  u 2,  M3,  M4,  s]  (18) 


respectively.  In  this  benchmark  study,  we  define  the  the  sys¬ 
tem  state  and  control  action  as  follow: 

x\  \  the  coordinate  of  horizontal  (x)  axis; 
x2‘  the  coordinate  of  horizontal  (y)  axis; 
mi:  the  direction — up; 

M2:  the  direction — down; 

M3:  the  direction — left; 

M4:  the  direction — right. 

We  assign  that  Uc  =  1  and  normalize  the  inputs  for  the 
action  network  to  be  in  [0,  2].  We  set  10  independent  updat¬ 
ing  sequences  for  10  runs  (i.e.  the  updating  sequence  in  each 
runs  is  independent).  Each  run  includes  500  trials  and  each 
trial  starts  with  the  initial  state  loaded  from  the  updating 
sequence.  We  set  infinite  step  looking-ahead  in  our  simula¬ 
tion  studies.  Each  trial  can  only  be  terminated  when  the  agent 
reaches  the  goal  or  hits  the  bound.  Therefore,  the  steps  that 
the  agent  move  in  each  trial  are  not  necessary  the  same.  The 
J(x,u )  table  is  initialized  as  all  zero  at  the  very  beginning 
and  is  only  updated  after  the  agent  finish  each  trial.  We  then 
normalized  the  J(x,u)  values  to  be  in  [0,  1]  to  show  the 
difference  with  the  reference  value  table. 

We  assume  that  both  our  proposed  GrHDP  approach  and 
the  traditional  HDP  approach  start  with  the  same  initial 
weights  (uniformly  initialized  in  [—0.3,  0.3])  and  the  same 
updating  sequence.  The  learning  rates  and  internal  stopping 
criteria  for  both  approaches  are  also  set  to  be  the  same.  Adap¬ 
tive  learning  rate  (ALR)  is  used  in  our  simulation.  The  initial 
learning  rates  for  the  action  network,  critic  network  and  the 
goal  network  are  set  to  be  5e  —  3,  le  —  2  and  1.2e  —  2, 
respectively,  and  they  will  be  decreased  by  dividing  2  every 
10  trials.  We  keep  the  learning  rates  to  be  le  —  10  thereafter 
if  they  are  under  le  —  10  after  dividing.  In  addition,  we  also 
set  a  counter  for  the  four  actions/directions  taken  for  all  the 
states.  For  a  specific  state,  for  instance,  if  any  action  (i.e.  up, 
down,  left,  right)  is  taken  over  a  preset  number  (like  30  in  this 
case  study),  we  will  randomly  pick  up  another  direction  from 
the  remaining  choices  as  the  final  decision.  We  hope  that  all 
the  directions  could  be  tried  enough  times  to  guarantee  that 
the  agent  can  learn  from  both  failure  and  success. 

In  this  simulation  study,  we  introduce  Q  reference  value 
table  according  to  the  distance  between  the  current  location 
and  the  goal  as  that  in  Ilin  et  al.  (2008),  Ni  et  al.  (2013). 
The  values  for  the  states  that  are  one  step  from  the  goal  are 
assigned  to  be  1  and  the  values  for  the  other  states  will  drop 
for  each  step,  where  L  and  W  refer  to  the  length  and 
width  of  the  maze,  respectively.  For  maze  size  of  10  *  10,  the 
difference  between  each  step  is  set  as  0.05.  Therefore,  we 
define  the  Q  reference  table  as 

1 

Qrefi*  1,  *2)  =  1  -  7  '  777  •  (L  —  Xl  +  W  —  X2  —  1) 

L  +  W 

(19) 
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Furthermore,  we  define  the  sum  of  squared  error  as 

1  X\  =L,X2=W  2 

Esum  =  ^  ^  ,  ^/(xi,X2)  —  Qref(x (20) 

X]  =1  ,X2=l 

In  this  case  study,  we  set  the  maze  size  as  10  *  10  and  the 
goal  locates  at  [10,  10].  The  learning  curves  are  presented 
in  Fig.  3,  where  the  v  axis  refers  to  the  number  of  the  trial 
and  the  y  axis  refers  the  sum  of  squared  error.  Note  that 
all  the  curves  presented  here  are  the  average  values  of  10 
independent  runs. 

From  Fig.  3,  we  can  see  that  both  approaches  start  with 
the  same  sum  of  squared  errors  as  we  initialize  J(x,  u)  to  be 


all  zero  at  the  very  beginning.  When  the  agent  starts  to  move, 
one  can  see  that  our  proposed  GrHDP  approach  shows  faster 
convergent  speed  with  regard  to  the  sum  of  squared  errors. 
Moreover,  our  proposed  GrHDP  approach  can  also  achieve 
lower  steady  error  than  that  with  HDP  approach.  In  addition, 
we  also  provide  the  histogram  of  internal  goal  sr  in  10  inde¬ 
pendent  runs  in  Fig.  4.  The  statistical  results  show  that  the 
internal  goals  are  within  the  range  that  is  almost  symmet¬ 
ric  with  zero  point.  Furthermore,  we  provide  the  value  table 
learned  with  our  proposed  GrHDP  approach  in  Fig.  5,  the 
values  in  which  are  the  average  of  10  runs.  We  can  obtain 
the  tendency  that  the  values  become  larger  if  the  the  agent 
approaches  the  goal  in  the  upper-right  corner.  The  surface 
plot  of  this  value  table  (i.e.  the  same  value  table  in  Fig.  5) 


- GrHDP 


Fig.  3  The  learning  curves  for  the  sum  of  squared  errors  with  GrHDP 
and  HDP  approaches,  respectively 
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Fig.  5  The  value  table  for  the  10  *  10  maze.  The  value  in  each  blank 
refers  to  the  max  value  that  the  agent  can  obtain  among  the  four  possible 
directions 


Fig.  4  The  histogram  of  internal  goal  in  ten  independent  runs.  Ten 
different  colors  are  adopted  to  represent  ten  independent  runs 


Fig.  6  The  surf  plot  of  the  value  table  for  the  10*10  maze.  The  value 
of  the  goal  location  has  been  set  to  1  in  this  plot 
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is  presented  in  Fig.  6,  where  v-axis  and  y-axis  refer  to  the 
coordinates  of  the  agent  and  z-axis  refers  to  the  correspond¬ 
ing  J  value.  It  is  clear  to  see  the  smooth  surface  and  the  value 
goes  bigger  from  the  origin  (i.e.  [0,  0])  to  the  goal  location 
(i.e.  [10,  10]). 

In  addition,  we  have  also  compared  the  computational 
cost  for  both  approaches.  Here  we  only  count  the  compu¬ 
tation  time  for  the  part  of  learning  algorithms  (i.e.  only  the 
weights  tuning  are  counted  in  both  approaches).  As  the  learn¬ 
ing  procedure  is  focus  in  the  first  200  trials,  we  would  like 
to  compare  the  time-cost  in  this  region  for  both  approaches. 
Simulation  results  show  that  our  proposed  GrHDP  approach 
requires  0.023s  per  trial,  comparing  with  0.016s  per  trial 
with  HDP  approach  (simulations  are  conducted  based  on  Sun 
server  with  16  GB  memory,  Intel  Xeon  CPU  3.60  GHz  and 
Matlab  R2013a).  Certainly,  our  proposed  GrHDP  approach 
include  one  more  neural  network  and  thus  could  take  addi¬ 
tional  memory  space  for  this  goal  network.  Our  key  inter¬ 
ests  from  this  perspective  in  this  paper  are  the  convergent 
speed  and  the  optimal  policy,  in  which  our  proposed  approach 
achieves  much  better  performance  compared  to  the  regular 
HDP  approach. 


4  Conclusions 

In  this  paper,  a  goal  representation  heuristic  dynamic  pro¬ 
gramming  is  introduced  and  analyzed  for  a  classical  maze 
navigation  benchmark.  We  studied  the  GrHDP  architecture 
design  and  its  learning  algorithms.  In  order  to  demonstrate  the 
improved  performance,  we  compare  the  learning  results  of 
our  proposed  GrHDP  approach  with  that  of  traditional  HDP 
approach  based  on  a  10  *  10  maze  navigation  benchmark 
under  the  same  environment  settings.  The  learning  curves 
and  the  value  table  justify  the  improved  performance  com¬ 
paring  with  traditional  HDP  approach. 
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